Fix setting V1 format version for Non-REST catalogs #411

amogh-jahagirdar · 2024-02-11T01:30:32Z

Currently we always set V2 format for create table for non-rest Catalogs regardless of the specified version. This change addresses that.

amogh-jahagirdar · 2024-02-11T01:33:55Z

pyiceberg/table/metadata.py

+        schema = data.get("schema")
+        if schema and isinstance(schema, dict):
+            if "schema_id" not in data["schema"]:
+                data["schema"]["schema_id"] = DEFAULT_SCHEMA_ID


When I explicitly pass in a schema it's no longer a dictionary. Then this validator fails since data["schema"] is an acutal Schema and cannot be referenced. I'm not really familiar with Pydantic but my understanding of this validator is that it's meant for just converting the schema metadata a common representation that can be shared with V2 metadata. I think that should only happen when reading the actual JSON file.

I'm still working through another issue for partition specs in V1 (this error is thrown in the updated test):

pydantic_core._pydantic_core.ValidationError: 2 validation errors for TableMetadataV1 partition_spec.0 Input should be a valid dictionary [type=dict_type, input_value=('spec_id', 0), input_type=tuple] For further information visit https://errors.pydantic.dev/2.6/v/dict_type partition_spec.1 Input should be a valid dictionary [type=dict_type, input_value=('fields', ()), input_type=tuple] For further information visit https://errors.pydantic.dev/2.6/v/dict_type

I'm not sure why this happens for V1 metadata explicitly constructed but V2 works fine.

Hey @amogh-jahagirdar Yes, this is a bit awkward in Pydantic. This is a @model_validator(mode="before") which means that it will always apply the validator before trying to construct the model. This will allow you to correct certain things before applying the checks. This particular function will make sure that the schema-id is set. Where schema_id is required. Keep in mind that this also does have applied the aliases, so both schema_id and schema-id are valid.

The before-mode model validator also be applied when you initialize something off: TableMetadataV1(bla=(1,2,3))

The error that you're seeing is after the mode="before" validators are applied. I would put a breakpoint in the validator, and trace up the stack where the bogus data comes from.

tests/catalog/test_hive.py

Fokko

Thanks for fixing this :)

pyiceberg/table/metadata.py

Fokko · 2024-02-11T01:44:49Z

pyiceberg/table/metadata.py

-            data["schema"]["schema_id"] = DEFAULT_SCHEMA_ID
+        schema = data.get("schema")
+        if schema and isinstance(schema, dict):
+            if "schema_id" not in data["schema"]:


Suggested change

if "schema_id" not in data["schema"]:

if "schema_id" not in schema:

pyiceberg/table/metadata.py

tests/catalog/test_hive.py

Fokko · 2024-02-11T01:56:36Z

pyiceberg/table/metadata.py

+        schema = data.get("schema")
+        if schema and isinstance(schema, dict):
+            if "schema_id" not in data["schema"]:
+                data["schema"]["schema_id"] = DEFAULT_SCHEMA_ID


Hey @amogh-jahagirdar Yes, this is a bit awkward in Pydantic. This is a @model_validator(mode="before") which means that it will always apply the validator before trying to construct the model. This will allow you to correct certain things before applying the checks. This particular function will make sure that the schema-id is set. Where schema_id is required. Keep in mind that this also does have applied the aliases, so both schema_id and schema-id are valid.

The before-mode model validator also be applied when you initialize something off: TableMetadataV1(bla=(1,2,3))

The error that you're seeing is after the mode="before" validators are applied. I would put a breakpoint in the validator, and trace up the stack where the bogus data comes from.

pyiceberg/table/metadata.py

Fokko

Thanks for working on this during the weekend. This looks good, I just have some concerns about setting the schema and partition-spec using the before-validator.

pyiceberg/table/metadata.py

Fokko · 2024-02-11T19:24:06Z

pyiceberg/table/metadata.py

@@ -313,6 +315,34 @@ def construct_partition_specs(cls, data: Dict[str, Any]) -> Dict[str, Any]:

        return data

+    @model_validator(mode="before")
+    def construct_v1_spec_from_v2_fields(cls, data: Dict[str, Any]) -> Dict[str, Any]:


I'd rather avoid the before validators to patch in the partition-spec and schema (probably sort-order as well?). I'd much rather see that we explicitly pass in the spec below:

if format_version == 1: return TableMetadataV1( location=location, schema=fresh_schema, last_column_id=fresh_schema.highest_field_id, current_schema_id=fresh_schema.schema_id, + partition_spec=fresh_partition_spec, partition_specs=[fresh_partition_spec], default_spec_id=fresh_partition_spec.spec_id, + sort_order=fresh_sort_order, sort_orders=[fresh_sort_order], default_sort_order_id=fresh_sort_order.order_id, properties=properties, last_partition_id=fresh_partition_spec.last_assigned_field_id, table_uuid=table_uuid, )

My biggest concern is again that this is a before validator, and we don't take into account the aliases (partition-specs, partition-spec-id, etc).

It looks like Java just ignores the schema if there is a schemas: https://github.com/apache/iceberg/blob/5f577f1b9902ffe6181897a439686a31fc81b89a/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L343-L368

Yeah it just felt a bit awkward to pass in the fields in a redundant way. As long as the source of truth is correct when reading the fields, which it is as you saw for schemas then we're good. I've updated.

Also PartitionSpec on the V1 metadata is a list[dict[str, any]], right now I work with that to preserve compatibility but that does make things a bit awkward since we have to pass in the model_dump of partition spec instead of just the normal PartitionSpec.

Actually @HonahX had a good catch here, it's not just a dump of partition spec, the List[dict[str, any]] is a list of the fields themselves based on the spec. I updated test_metadata. Also, The tests in #245 exercise this path more so going forward so we should have more coverage.

HonahX

Thanks for taking this! @amogh-jahagirdar It's great that we no longer require extra before validator for supporting the v1 metadata. Setting the relevant fields explicitly is a smart approach

pyiceberg/table/metadata.py

amogh-jahagirdar · 2024-02-12T04:31:23Z

Thanks for taking this! @amogh-jahagirdar It's great that we no longer require extra before validator for supporting the v1 metadata. Setting the relevant fields explicitly is a smart approach

Thanks but I don't know about that tbh. I think it's a pretty bad API that we are somewhat forced to specify all fields even though it's redundant for V1 metadata (because it's deprecated so we want to specify the new fields, but partition-spec is still required). We should revisit that imo, but I think that can be done in a follow on PR since the goal here is to solve the original problem of not being able to create V1 tables in non-rest catalogs. Also doesn't really matter in practice too much since constructing new metadata is done behind the catalog anyways which is the API the vast majority of people will be interacting with.

HonahX

LGTM!

I think it's a pretty bad API that we are somewhat forced to specify all fields even though it's redundant for V1 metadata

I guess that's the case with maintaining backward compatibility :(. Fortunately, as you mentioned, this is behind new_table_metadata and catalog, thereby not requiring users to specify any redundant fields.

I thought pydantic's before validator might help, but as we saw in this PR, it's more effective for parsing raw data than for constructing models directly. So, it looks like our current approach is the best we have right now.

We should revisit that ...

Totally agree! For the time being, this approach should work fine. We may need to revisit it when we need to deprecate some fields or introduce V3 metadata.

Fokko

Thanks @amogh-jahagirdar for fixing this over the weekend, and @HonahX for jumping in for the review 🙌

amogh-jahagirdar commented Feb 11, 2024

View reviewed changes

tests/catalog/test_hive.py Outdated Show resolved Hide resolved

Fokko reviewed Feb 11, 2024

View reviewed changes

Fokko added this to the PyIceberg 0.6.0 release milestone Feb 11, 2024

amogh-jahagirdar force-pushed the fix-format-version-metastore-catalogs branch 5 times, most recently from 8c5f085 to e3f5dcb Compare February 11, 2024 16:43

amogh-jahagirdar commented Feb 11, 2024

View reviewed changes

pyiceberg/table/metadata.py Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the fix-format-version-metastore-catalogs branch 3 times, most recently from 4d8e486 to d0c964f Compare February 11, 2024 18:40

amogh-jahagirdar commented Feb 11, 2024

View reviewed changes

pyiceberg/table/metadata.py Outdated Show resolved Hide resolved

amogh-jahagirdar requested a review from Fokko February 11, 2024 18:51

amogh-jahagirdar force-pushed the fix-format-version-metastore-catalogs branch from d0c964f to 2354919 Compare February 11, 2024 18:53

Fokko reviewed Feb 11, 2024

View reviewed changes

amogh-jahagirdar force-pushed the fix-format-version-metastore-catalogs branch 5 times, most recently from 7b3aa35 to 9fc37fe Compare February 12, 2024 00:42

amogh-jahagirdar requested a review from Fokko February 12, 2024 00:54

HonahX reviewed Feb 12, 2024

View reviewed changes

pyiceberg/table/metadata.py Outdated Show resolved Hide resolved

pyiceberg/table/metadata.py Outdated Show resolved Hide resolved

Fix setting V1 format version for Non-REST catalogs

4dbc2de

amogh-jahagirdar force-pushed the fix-format-version-metastore-catalogs branch from 9fc37fe to 4dbc2de Compare February 12, 2024 04:40

HonahX approved these changes Feb 12, 2024

View reviewed changes

Fokko approved these changes Feb 12, 2024

View reviewed changes

Fokko merged commit dab5d76 into apache:main Feb 12, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix setting V1 format version for Non-REST catalogs #411

Fix setting V1 format version for Non-REST catalogs #411

amogh-jahagirdar commented Feb 11, 2024

amogh-jahagirdar Feb 11, 2024 •

edited

Loading

Fokko Feb 11, 2024

Fokko left a comment

Fokko Feb 11, 2024

Fokko Feb 11, 2024

Fokko left a comment

Fokko Feb 11, 2024

amogh-jahagirdar Feb 12, 2024

amogh-jahagirdar Feb 12, 2024 •

edited

Loading

HonahX left a comment •

edited

Loading

amogh-jahagirdar commented Feb 12, 2024 •

edited

Loading

HonahX left a comment •

edited

Loading

Fokko left a comment

	if "schema_id" not in data["schema"]:
	if "schema_id" not in schema:

Fix setting V1 format version for Non-REST catalogs #411

Fix setting V1 format version for Non-REST catalogs #411

Conversation

amogh-jahagirdar commented Feb 11, 2024

amogh-jahagirdar Feb 11, 2024 • edited Loading

Choose a reason for hiding this comment

Fokko Feb 11, 2024

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Fokko Feb 11, 2024

Choose a reason for hiding this comment

Fokko Feb 11, 2024

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Fokko Feb 11, 2024

Choose a reason for hiding this comment

amogh-jahagirdar Feb 12, 2024

Choose a reason for hiding this comment

amogh-jahagirdar Feb 12, 2024 • edited Loading

Choose a reason for hiding this comment

HonahX left a comment • edited Loading

Choose a reason for hiding this comment

amogh-jahagirdar commented Feb 12, 2024 • edited Loading

HonahX left a comment • edited Loading

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

amogh-jahagirdar Feb 11, 2024 •

edited

Loading

amogh-jahagirdar Feb 12, 2024 •

edited

Loading

HonahX left a comment •

edited

Loading

amogh-jahagirdar commented Feb 12, 2024 •

edited

Loading

HonahX left a comment •

edited

Loading